# Sequence Parallel (SP)

Sequence Parallel splits long sequences across multiple GPUs along the sequence dimension, enabling training with sequence lengths that exceed single-GPU memory. Twinkle implements Ulysses-style sequence parallel with optional derived ring attention.

## Overview

| Concept | Description |
|---------|-------------|
| **SequenceParallelConfig** | Configuration dataclass for SP |
| **SequenceParallelStrategy** | Strategy class that wraps SP lifecycle |
| **SequenceParallel** | Core implementation handling pad/split/gather |

## Configuration

```python
from twinkle.model.transformers.strategy.sequence_parallel import SequenceParallelConfig

config = SequenceParallelConfig(
    enabled=True,           # Enable sequence parallel
    ulysses_size=None,      # Ulysses SP degree (auto-derived from DeviceMesh if None)
    gather_logits=True,     # Gather logits after forward for loss computation
)
```

## Usage with DeviceMesh

SP is activated by setting `ulysses_size` in `DeviceMesh.from_sizes()`:

```python
from twinkle.utils import DeviceMesh

# 8 GPUs: 4-way Ulysses SP × 2-way data parallel
device_mesh = DeviceMesh.from_sizes(
    world_size=8,
    dp_size=2,
    ulysses_size=4,
)
```

## How It Works

1. **Pad** — input sequences are padded to a length divisible by SP world size
2. **Split** — padded inputs are evenly split across SP ranks along the sequence dimension
3. **Distributed Attention** — FlashAttention2 is patched to perform Ulysses all-to-all communication before/after attention computation
4. **Gather** — after forward, logits are gathered back to full sequence length for loss computation

## Supported Attention Backends

| Backend | Status |
|---------|--------|
| FlashAttention2 | Fully supported (including packed/padding-free sequences) |
| SDPA | Supported (non-packed batches only) |
| Derived Ring Attention | Supported with FlashAttention2 only (`rp_world_size > 1`) |

## Qwen3.5 Linear Attention

SP automatically detects Qwen3.5 GatedDeltaNet linear attention layers and applies the `Qwen3_5GatedDeltaNetUlyssesPatch` for correct sequence-parallel behavior on hybrid attention architectures.

## MoE Auxiliary Loss

For MoE models, SP automatically installs a forward hook that gathers router logits across SP ranks before auxiliary loss computation, ensuring correct load-balancing signals.

## Key Constraints

- `num_key_value_heads` must be divisible by `ulysses_size` (for Ulysses) or use ring attention fallback
- Packed/padding-free batches require FlashAttention2
- Derived ring attention requires `batch_size == 1` (packed format)
- `torch.distributed` must be initialized